Unsupervised Analysis of Days of Week

Treating crossings each day as features to learn about the relationships between various days.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn; sn.set()
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from jupyterworkflow.data import get_freemont_data

Get Data


In [2]:
data = get_freemont_data()
data.head()


Out[2]:
East West Total
Date
2012-10-03 00:00:00 4.0 9.0 13.0
2012-10-03 01:00:00 4.0 6.0 10.0
2012-10-03 02:00:00 1.0 1.0 2.0
2012-10-03 03:00:00 2.0 3.0 5.0
2012-10-03 04:00:00 6.0 1.0 7.0

In [3]:
data.resample('W').sum().plot()


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0xc7b9780>

In [4]:
ax = data.resample('D').sum().rolling(365).sum().plot();
ax.set_ylim(0, None);



In [5]:
data.groupby(data.index.time).mean().plot();



In [6]:
pivoted = data.pivot_table('Total', index=data.index.time,
                          columns=data.index.date)
pivoted.iloc[:5, :5]


Out[6]:
2012-10-03 2012-10-04 2012-10-05 2012-10-06 2012-10-07
00:00:00 13.0 18.0 11.0 15.0 11.0
01:00:00 10.0 3.0 8.0 15.0 17.0
02:00:00 2.0 9.0 7.0 9.0 3.0
03:00:00 5.0 3.0 4.0 3.0 6.0
04:00:00 7.0 8.0 9.0 5.0 3.0

In [7]:
pivoted.plot(legend=False, alpha=0.01);



In [8]:
X = pivoted.fillna(0).T.values 
X.shape


Out[8]:
(1610, 24)

Principal Component Analysis (PCA)


In [9]:
X2 = PCA(2, svd_solver='full').fit_transform(X)
X2.shape


Out[9]:
(1610, 2)

Unsupervised Clustering


In [10]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(2)
gmm.fit(X)
labels = gmm.predict(X)
labels


Out[10]:
array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [11]:
import matplotlib.pyplot as plt
plt.scatter(X2[:,0], X[:,1], c=labels, cmap='rainbow')
plt.colorbar()


Out[11]:
<matplotlib.colorbar.Colorbar at 0xe624978>

In [12]:
fix, ax = plt.subplots(1, 2, figsize=(14, 6))
pivoted.T[labels == 0].T.plot(legend=False, alpha=0.1, ax=ax[0]);
pivoted.T[labels == 1].T.plot(legend=False, alpha=0.1, ax=ax[1]);

ax[0].set_title('Purple Cluster');
ax[1].set_title('Red Cluster');


Comparing with Day of Week


In [13]:
dayofweek = pd.DatetimeIndex(pivoted.columns).dayofweek

In [14]:
plt.scatter(X2[:,0], X[:,1], c=dayofweek, cmap='rainbow')
plt.colorbar()


Out[14]:
<matplotlib.colorbar.Colorbar at 0xf8ce7f0>

Analyzing Outliers

The following points are weekdays with a holiday-like pattern


In [15]:
dates = pd.DatetimeIndex(pivoted.columns)
dates[(labels == 1) & (dayofweek < 5)]


Out[15]:
DatetimeIndex(['2012-11-22', '2012-11-23', '2012-12-24', '2012-12-25',
               '2013-01-01', '2013-05-27', '2013-07-04', '2013-07-05',
               '2013-09-02', '2013-11-28', '2013-11-29', '2013-12-20',
               '2013-12-24', '2013-12-25', '2014-01-01', '2014-04-23',
               '2014-05-26', '2014-07-04', '2014-09-01', '2014-11-27',
               '2014-11-28', '2014-12-24', '2014-12-25', '2014-12-26',
               '2015-01-01', '2015-05-25', '2015-07-03', '2015-09-07',
               '2015-11-26', '2015-11-27', '2015-12-24', '2015-12-25',
               '2016-01-01', '2016-05-30', '2016-07-04', '2016-09-05',
               '2016-11-24', '2016-11-25', '2016-12-26', '2017-01-02',
               '2017-02-06'],
              dtype='datetime64[ns]', freq=None)